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Abstract This paper considers the challenge of evaluating a set of classifiers, as 
done in shared task evaluations like the KDD Cup or NIST TREC, without ex- 
pert labels. While expert labels provide the traditional cornerstone for evaluating 
statistical learners, limited or expensive access to experts represents a practical 
bottleneck. Instead, we seek methodology for estimating performance of the clas- 
sifiers (relative and absolute) which is more scalable than expert labeling yet pre- 
serves high correlation with evaluation based on expert labels. We consider both: 
1) using only labels automatically generated by the classifiers themselves (blind 
evaluation); and 2) using labels obtained via crowdsourcing. While crowdsourcing 
methods are lauded for scalability, using such data for evaluation raises serious 
concerns given the prevalence of label noise. In regard to blind evaluation, two 
broad strategies are investigated: combine & score and score & combine. Combine 
& Score methods infer a single "pseudo-gold" label set by aggregating classifier 
labels; classifiers are then evaluated based on this single pseudo-gold label set. On 
the other hand, score & combine methods: i) sample multiple label sets from clas- 
sifier outputs, ii) evaluate classifiers on each label set, and iii) average classifier 
performance across label sets. When additional crowd labels are also collected, 
we investigate two alternative avenues for exploiting them: 1) direct evaluation of 
classifiers; or 2) supervision of combine- and- score methods. To assess generality of 
our techniques, classifier performance is measured using four common classification 
metrics, with statistical significance tests establishing relative performance of the 
classifiers for each metric. Finally, we measure both score and rank correlations be- 
tween estimated classifier performance vs. actual performance according to expert 
judgments. Rigorous evaluation of classifiers from the TREC 2011 Crowdsourcing 
Track shows reliable evaluation can be achieved without reliance on expert labels. 
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Fig. 1 Our experimental framework used. As input, K binary classifiers each label M exam- 
ples, (a) As ground truth, classifiers are scored for several metrics based on expert judgments, 
statistical significance of differences is computed, and classifiers are ranked (best to worst), 
(b) An estimation method p is used to predict classifier scores without expert judgments, and 
classifiers are ranked accordingly to estimated scores (score differences which are not statis- 
tically significant yield tied rankings). Score and rank correlation is then measured between 
estimated vs. actual scores and ranks {cl). (b') A second, alternative method q is used to 
estimate classifier scores, classifiers are ranked accordingly, and correlation of scores and ranks 
vs. ground truth is measured (c2). Finally, we compare the correlations cl and c2 to determine 
whether p or q achieved the greatest score and rank correlation (with statistical significance). 



Keywords Evaluation • Performance prediction • Crowdsourcing • Label 
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1 Introduction 



While expert labels provide the traditional cornerstone for evaluating statistical 
learners, limited or expensive access to experts represents a practical bottleneck. 
To be more specific, consider the field of Information Retrieval (IR), where expert 
labels in the form of relevance judgments provide the foundation for evaluating IR 
systems under the Cranfield paradigm [10]. Such labels enable evaluation of IR 
systems for both ranking and classification scenarios: ranking items in a stand- 
ing collection for ad hoc queries, or classifying new items as they arrive based on 
standing queries (e.g., RSS filtering). In order to accurately characterize system 
effectiveness, experience has shown that IR systems should be evaluated at the op- 
erational scale at which they will be used in practice. However, as collections sizes 
have rapidly grown in recent years, it has become increasingly infeasible to manu- 
ally label so many examples using traditional expert labeling. In at least one case, 
insufficient labels have compromised a NIST TREC shared task evaluation |35| . 
As such, the IR community has become particularly interested in developing more 
scalable evaluation methodology. While statistical sampling techniques and robust 
evaluation metrics have significantly reduced the number of examples needing to 
be labeled for stable evaluation [2,6,9,27], expert labeling remains a significant 
bottleneck. Another strategy, inferring implicit labels from people's behavior us- 
ing the system [20], requires large user populations. New crowdsourcing methods 
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offer potentially large time and cost savings vs. traditional expert labeling, but 
present new quality concerns 1,4 21lf30] . 

We seek methodology for estimating performance of the classifiers (relative and 
absolute) which is more scalable than expert labeling yet preserves high correla- 
tion with evaluation based on expert labels. While the IR community has directed 
considerable attention toward estimating effectiveness of ranking algorithms, rel- 
atively little work has explored estimation of classifier performance. To this end, 
we investigate both: 1) using only labels automatically generated by the classifiers 
themselves {blind evaluation); and 2) using labels obtained via crowdsourcing. In 
regard to blind evaluation, two broad strategies are investigated: combine & score 
and score & combine. Combine & Score methods infer a single "pseudo-gold" la- 
bel set by aggregating classifier labels; classifiers are then evaluated based on this 
single pseudo-gold label set. This strategy builds on an active area of machine 
learning research developing methods to aggregate multiple judgments into a sin- 
gle, consensus judgment set [131126] . On the other hand, score & combine methods: 
i) sample multiple label sets from classifier outputs, ii) evaluate classifiers on each 
label set, and iii) average classifier performance across label sets. To further investi- 
gate evaluation the reliability-cost tradeoff, we investigate two alternative avenues 
for exploiting additional labels collected via crowdsourcing: 1) direct evaluation of 
classifiers; or 2) supervision of combine- and- score methods. 

To assess generality of our techniques, classifier performance is measured using 
four common classification metrics, with statistical significance tests establishing 
relative performance of the classifiers for each metric. To evaluate our techniques, 
we measure both score and rank correlations between estimated vs. actual classifier 
performance (relative and absolute). 

Figure [I] depicts our experimental framework. Experiments are conducted with 
ten binary classifiers submitted to the TREC 2011 Crowdsourcing TraclQ inves- 
tigate the following research questions: 1) Can we reliably estimate classifier per- 
formance without expert judgments? 2) How much benefit do crowd judgments 
provide over blind evaluation? 3) Which methods provide the best score and/or 
rank correlation for each labeled data condition and classifier metric of interest? 4) 
How robust is evaluation based on combine & score methods to their labeling er- 
rors? 5) How effectively can we evaluate outlier classifiers without any judgments? 

Results show high score correlation for three of the four classifier metrics con- 
sidered. While crowd judgments are not seen to provide significant improvement for 
score correlation, they do significantly benefit rank correlation. When crowd judg- 
ments are available, we find direct evaluation on them outperforms using them to 
supervise combine & score methods. In the blind evaluation case, simple sampling- 
based evaluation is typically as effective as more complicated EM, but significantly 
outperforms the more popular MV approach. As expected, lower quality of labels 
output by combine & score does yield less accurate evaluation, though evaluation 
is reasonably tolerant of some amount of label noise. Finally, blind evaluation for 
outliers is surprisingly accurate, though use of crowd judgments will likely be more 
common in practice to achieve a more accurate ranking. 



1 https : //sites . google . com/site/treccrowd 
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2 Related Work 

A broad concern with human labeling, such as relevance judging, is ensuring label 
consistency such that systems can be effectively trained and evaluated. A decade 
ago, Voorhees showed that reliable evaluation of search systems could be achieved 
despite significant variations in the underlying relevance judgments [Mj. Soboroff 
et al. [31] took this idea further: could we forgo human labeling entirely (i.e., 
blind evaluation) by sampling relevant documents randomly after pooling outputs 
from all systems participating in a shared task evaluation? While such evaluation 
indeed correlated positively and significantly with use of human judgments, pre- 
dicted performance was less reliable for the best systems. Aslam and Savell [3] 
achieved comparable correlation by simply scoring each system by its mean Jac- 
card Coefficient over the set of retrieved documents vs. those retrieved by each 
other system. Wu and Crestani exploited similar "reference count" popularity, 
modeling expected correlation between relevance and the rank and frequency at 
which documents are retrieved by systems [36] , similar to rank fusion [II] . Several 
other blind evaluation methods have also been explored [15,18,28,32 . 

Lam and Stork [23] investigated correlation between label noise and classifier 
error. Biittcher et al. [7] investigated classifier evaluation with biased labels. Cor- 
mack et al. [12] used a pseudo-ground truth generated by spam filters to evaluate 
the filters, and demonstrated a lower error rate compared with labels obtained 
from natural sources (e.g., user labels and exhaustive adjudication by experts). In 
comparison to these prior works, our combine & score methods approximate the 
pseudo-ground truth in more general fashion via aggregating classifier outputs for 
consensus. Cormack et al. [12 approximate the pseudo-gold labels by comparing 
each pair of spam filter's relative performance over some predefined measures. This 
approach is not directly applicable to generate consensus labels in a general way 
since it needs additional effort to compare each classifier's performance. Moreo- 
ver, we investigate the benefit of using crowdsourced labels for the supervision of 
classifiers and the effect of direct evaluation of classifiers with them. 



3 Estimation Methods 

This section describes various methods for estimating classifier performance using 
either no expert labels (blind evaluation) or labels via crowdsourcing. We organize 
methods into two general classes: score & combine and combine & score. 

Score & Combine methods resemble the sampling approach of Soboroff et 
al. [31] . Each classifier is evaluated by its average performance across some num- 
ber of pseudo-gold judgment sets sampled from classifier outputs (i.e. we score 
classifiers, then combine the scores). In contrast, combine & score methods first 
aggregate classifier outputs into a single judgment set, then evaluate classifiers on 
this consensus judgment set (i.e. we combine the labels, then score the classifiers). 
The combine & score approach follows a vein of machine learning research into 
how to effectively integrate redundant labels produced by multiple systems [13] or 
people [13] to yield consensus labels. This area has become particularly active in 
the context of crowdsourcing to aggregate noisy human labels [26] . 

Score & Combine estimation methods are always blind (i.e. performed without 
labeled data). Combine & Score methods span both unsupervised and supervised 
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methods. For unsupervised methods, we consider Majority Vote (MV) and Ex- 
pectation Maximization (EM). For supervised methods, we consider a variety of 
approaches: calibrated MV, Naive Bayes (NB), Support Vector Machine (SVM), 
Generalized Linear Model (GLM), and AdaBoost. We also consider simply using 
crowdsourced labels (in lieu of expert judgments) to directly evaluate classifiers. 

Notation. K input classifiers c\.k each label M examples Xi.m with labels 
Imk- Each example x m has true label l(x m ) = C m £ {0, 1} for binary classification. 



3.1 Blind Methods: Score & Combine 
3.1.1 Round-robin 

In round-robin evaluation with K classifiers, we select each classifier in turn and 
use its labels to evaluate all K — 1 other classifiers. Each classifier is then scored 
by its mean performance over all K — 1 trials. 



3.1.2 Sampling 

With three human assessors judging 50 topics, Voorhees [32] performed topic-level 
sampling to generate new judgment sets from the space of 3 50 possible combina- 
tions. In our case, for each example we randomly select a classifier to label it; with 
K classifiers and n examples, we sample our pseudo-gold from K n possible label- 
ings. Whereas round-robin estimation is based on a sample size of K — 1, sampling 
allows us arbitrarily increase the sample size £ to reduce variance (generating 
times more samples than with round-robin evaluation). We set £ = 1000. 



3.2 Blind Methods: Combine & Score 



3.2.1 Majority Vote 

Majority Vote (MV) is the simplest, best-known method for generating consensus: 
simply pick the label receiving the most votes. This is equivalent to computing the 
average label and rounding according to a decision threshold t: 



1 K 

iwH = 1 iff. -= ^2 lmk - t (!) 



K 

k= 



where t = | by default for unbiased rounding. With an even number of votes, 
ties are possible and broken randomly to avoid bias. More generally, the decision 
threshold may be varied to achieve a desired recall/specificity tradeoff (see Sec- 
for definitions of classification metrics). Setting t = labels all examples 



tion 



4.2 



as relevant (100% recall), while t > 1 labels all as non-relevant (100% specificity). 
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3.2.2 EM 

Expectation Maximization (EM) [13] estimates the error rates of each classifier Cfc 
by a latent confusion matrix [tt^ ], where ij-th element 7iy denotes the prob- 
ability of classifier c/j classifying an example to class j given the true label is i, 
estimated based on each example's class membership as: 



, (fc) _ L^m=\ J - min mi 



TV 



EM rp 
m=l ±rr 



ij s^C rp n (k) 



E 1 E 1 T - 



(2) 



m? 



Indicator = 1 iff. example x m receives label j from classifier Cfc , and indicator 
T m y = 1 iff- y is the true label for x m . Latent class prior pi-L is estimated by: 

1 M 

m— 1 

Since the true label for x m is unknown in the unsupervised case, EM uses a 
mixture of multinomials to estimate classifier accuracy. Assuming every pair of 
classifiers is independent, the probabilistic model likelihood can be written: 

m I c k c \ 

m=i \i=i fe=i j=i y 

Estimating the maximum likelihood in Equation [4] is analytically intractable 
since it involves computing the product of a summation. However, once we get 



estimates for latent parameters pi and n\j , we can derive new class membership 



r (fc) 

for label l m such that T m i = 1 if I becomes the estimated true label for 
example which maximizes: 

M K C 

m—l fc — 1 J — 1 

( fc) 

We then iteratively re-estimate latent pi and , and missing labels T m i from 



Equations [5] [3] and [5] until convergence. 



3.3 Direct Evaluation on Crowd Labels 

When crowdsourcing are available, the simplest way to use them is to evaluate 
classifiers on them directly. As with combine & score approaches, the classifiers 
are scored against a single judgment set, though here judgments come from the 
crowd rather than from the classifiers. 



3.4 Supervised Combine & Score Methods 

Another way to use crowd judgments is to supervise combine & score methods. 
We investigate a variety of approaches: calibrated MV, Naive Bayes (NB), Support 
Vector Machine (SVM), Generalized Linear Model (GLM), and AdaBoost. 
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Fig. 2 Majority Vote (MV) aggregates input classifier labels, then follows the Bayes optimal 
decision rule in deciding between classes. With training data, input bias can be detected to 
calibrate the decision threshold t. Agreement between MV an d ex pert judgments (development 
set) is shown for the four classification metrics from Section |4,2| 



3.4-1 Calibrated Majority Vote 



When training labels are available (supervised setting), we can detect input bias 



and calibrate the decision threshold t for MV (see Section 3.2.1 for description of 
unsupervised MV) . Figure [2] shows the impact of varying this decision threshold 
for the different classifier metrics considered. While unsupervised MV uses the 
default t = 0.5, in the supervised case we tune t to maximize metric performance. 



3.4.2 Naive Bayes 

Naive Bayes (NB) represents another approach which exploits supervision to try to 
infer better pseudo-gold [30| . We assume that each input label l m ,\:K for example 
x m from classifiers c\-.k is independent. The posterior probability is calculated 
from the prior probability of each class p(C = c) and the likelihood P(C = c\l rn , 1 _ K ). 
Likelihood is computed on the labels of each classifier. Inference is computed by: 



arg max 



p(C„ 



=) n P(lr 



\c„ 



c) 



(G) 



3-4-3 Generalized Linear Model 



The generalized linear model (GLM) extends ordinary linear regression [2l]. For 
consensus label estimation, the GLM fits the given set of classifier output labels 
l mi . k with training label C m for example m. It returns a K-l vector /3 of coefficient 
estimates for a generalized linear regression of the responses in x m for classifier 
labels l mi . k , using the binomial distribution. We use the logit link function ln(j^—). 
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ID 


Source 


Use 


Examples 


Labels Per Ex. 


Dl 


Classifier 


Input 


16,758 


10 


D2 


Crowd 


Train 


1,906 


1 


D3 


Waterloo 


Validation 


1,000 


1 


D4 


NIST 


Test 1 


1,000 


1 



Table 1 Experiments are performed using four distinct sets of binary labels (D1-D4) drawn 
the 2011 TREC Crowdsourcing Track. 



3.44 SVM 

A Support Vector Machine (SVM) learns a hyper-plane decision surface between 
classes, defined by learned "support vectors" which maximize the separation mar- 
gin between example classes observed in training data [5] . We adopt a simple linear 
kernel. The training function identifies support vectors sj, weights oli, and bias b 
that are used to classify vectors x according to the following equation: 

K 

g m {x) = a m k(s m , x) + b (7) 
fc=l 

where k is a kernel function, which is the dot product. If gm{x) > 0, then Z slJm (m) 
is classified as relevant (1), otherwise it is classified as non-relevant (0). 

3.4.5 AdaBoost 

Boosting is a meta-algorithm which learns iteratively weak classifiers with respect 
to a distribution and adding them to a final strong classifier in a weighted way. 
Until convergence, it iteratively re-weights the data. In this work, we adopt the 
well-known AdaBoost algorithm [1 TJ . 

4 Evaluation Methodology 

This section describes how we evaluate alternative methods for estimating clas- 
sifier performance. We begin by describing the datasets used in our study. Next, 
we define the classifier metrics used. Following this, we describe our method for 
ranking classifiers given their scores and results of statistical significance tests of 
differences between scores. Finally, we discuss the triangle testing method we use 
to test statistical significance of differences between score and rank correlations 
achieved by different score estimation methods. 

4.1 Data 

Table [l] describes the four datasets used in our experiments (D1-D4). Data are 
drawn from the TREC 2011 Crowdsourcing Track, which was comprised of two 
distinct tasks. Task 1 involved collecting crowd judgments, while Task 2 involved 
aggregating crowd judgments to classify each example. K = 10 teams participating 
in Task 2 classified M = 16, 758 examples, yielding the 10 classifiers to be evaluated 
and their outputs (Dl). 
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Expert judgments used come from two distinct sources. For development and 
tuning, we use a set of 1,865 expert labels produced by the University of Water- 
loo [29] (D3). For final testing, we use a balanced set of 1,000 NIST judgments 
(D4) on which classifiers were officially evaluated in the track. 

Crowd judgments used (D2) are drawn from Task 1 submissions. Each of 2,715 
examples was labeled by « 4-5 teams, collapsed to a single label by majority vote. 



4.2 Metrics for Classifier Performance 

To assess generality of methods across a range of potential metrics of interest, 
classifiers are evaluated on four common metrics: tp is the number of true positive 
classifications, fp false positives, tn true negatives, and fn false negatives: 

Accuracy (ACC) = tP+j^ (8) 

J V J tp + tn + fp + fn W 

Precision (PRE) = — ^— (9) 
V ' tp + fp W 

Recall (REC) = — ^— (10) 
tp + fn 

tn 

Specificity (SPE) = (11) 

tn + fp 

While many other classifier metrics could have also been studied, these four metrics 
represent a fair sample of potential metrics of interest for classification performed 
under varying operational settings. Statistical significance of observed differences 
is measured via a two-tailed, paired t-test. 



4.3 Ranking Classifiers For Each Metric 

Because not all differences in observed metric scores are statistically significant, 
it would be misleading to compare rank distinctions between classifiers whose 
score differences are not significant. This yields (potentially conflicting) pair-wise 
preference constraints from which we must then induce a ranking over classifiers. 
While optimal ordering is NP-Hard, simple heuristics exist [11] , and we allow and 
model tied ranks. There is also the question of loss function: should higher-ranked 
preference violations be penalized more heavily or should penalties be uniform at 
all ranks? 

To rank classifiers based on pair-wise preferences from significance testing, we 
use Copeland's method in which items are ordered by the number of pairwise victo- 
ries minus pairwise defeats [25 . Suppose three classifiers Ca-.c achieve correlation 
TA-.c- Assume differences of r^and rg, as well as rg and rc, are not statistically 
significant, but the difference ta vs. rc is significant. The final order induced is 
thus Ca > Cb > Cc, since Ca obtains one win (vs. Cc) and one tie (vs. Cb), B 
obtains two ties, and C obtains one tie and one loss. We adopt this method largely 
for its simplicity, though the impact of ordering algorithm and loss function on 
our evaluation will be further studied in future work. 
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4.4 Score and Rank Correlation Measures 

To compare the estimation methods introduced in Section [3] for each classification 
metric of potential interest, we measure correlation of estimated classifier scores 
and ranking of classifiers vs. actual scores and ranks according to expert judgments. 
We adopt standard correlation measures: Pearson's r for scores, and Spearman's p 
and Kendall's r for rankings. As a further measure of rank correlation, we measure 
Voorhees' "swap" % [33], which estimates the probability of a discordant pair, i.e. 
the chance of any two classifiers being ranked in the wrong order. While closely 
related to Kendall's r, this measure ignores ties and concordant pairs. Other rank 
correlation measures which reflect alternative loss functions 8,22,37] may be con- 
sidered in future work. Note that score correlation is more sensitive than rank 
correlation to large errors in estimating performance on outliers (best and worst 
systems). For example, the worst system receives the same lowest rank whether it 
is lOx or lOOx worse, yet its actual score may be harder to estimate. 

Statistical significance of rank correlation is typically concerned with determin- 
ing if a pair rankings are correlated, i.e. can we reject the null hypothesis Hq that 
the two rankings are uncorrelated? For example, given 1) some estimated ranking 
over the classifiers and 2) the actual ranking as determined by evaluation on expert 
judgments, is there a significant correlation between the two rankings? In general, 
we are not interested in this question because some correlation nearly always ex- 
ists. While the degree of correlation is of interest, this is what the coefficient tells 
us directly. 

Instead, what we want to detect is when one estimation method achieves sig- 
nificantly greater correlation than some other method, involving triangle signifi- 
cance testing between three rankings: the ranking given by evaluation on expert 
judgments vs. the other two estimated rankings. How do we test the statistical 
significance of differences in correlation? We are not familiar with established IR 
methodology for this. Letting r a t denoting the correlation coefficient between two 
sets of scores or rankings, the null hypothesis Ho supposes that two observed co- 
efficients r xy and r xz are equivalent (where x denotes reference scores or rankings 
based on expert judgments, while y and z denote two sets of estimated scores 
or rankings). We compute the t statistic for triangle significance testing following 
Hotelling [T9] : 

t= {.r xy -r xz )sj(n-3){l + ry~zj 

— r xy — r xz — ry Z + 2r X yr xz r yz ) 

with n-3 degrees of freedom and n being the triple ordered sample size, provided 
W|r| 7^ 1. For example, with n = 120, r xy = 0.73, r xz = 0.61, and r yz = 0.66, we 
find that t = 2.378 with p = 0.02, providing strong evidence for rejecting Hq. 



5 Results 

Recall the goal of our study is to identify effective methods for evaluating classifiers 
when we have no expert labels, having either no labels at all (blind evaluation) 
or only crowd labels. In comparing alternative methods for estimating classifier 
scores, we seek to identify methods whose estimated scoring and ranking of clas- 
sifiers achieves high correlation with evaluation using expert labels. 
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Table 2 Jaccard coefficient between relevant documents sets identified by each Dl classifier 
(called "overlap" by Voorhees 1341 and Sys Similarity by Aslam et al. [3] (x = .44, s = .21, 
max = . 75, min = . 02). 



We begin in Section |5.1| by showing variation across the label sets produced 
by each of the 10 classifiers. Similar to the earlier analysis of variation in asses- 
sor judgments presented by Voorhees [33], this analysis establishes the diversity 
of input label sets being used for evaluation. Next, Section |5.2| shows the actual 
scores and ranking of the 10 classifiers according to expert judgments (D3 and 



D4). Following this, Section 5.3 compares the score and rank correlation achieved 
by the various score & combine and combine & score estimation methods, in su- 
pervised and unsupervised settings, on the validation set. Analysis presented here 
informs our selection of the best performing unsupervised and supervised methods 
to evaluate on the test set (D4). Next, Section 5.4 presents correlation results on 
the test set. 



Additional analysis is presented in following sections. In Section |5.5| we analyze 
how sensitive our evaluation of classifiers is accuracy of the labels used (in place of 



expert judgments). Finally, Section 5.6 assesses accuracy of evaluation for outliers: 
the best and worst performing classifiers, whose outputs most differ from those of 
the other classifiers. 



5.1 Agreement Between Classifiers 

We begin by measuring overlap across the sets of relevant documents identified by 
the 10 different classifiers considered, providing a measure of their diversity. We 
note that Voorhees' "overlap" [3D is the Jaccard similarity coefficient computed 
between label sets (the size of the intersection over the size of their union). Pair- 
wise results are shown in Table[2] Overall, we observe overlap with mean x = 0.442 
and standard deviation s = 0.205 across the classifier set, relatively similar to 
overlap levels seen in earlier studies with human assessors. Crucially, though, we 
observe that classifier C8 exhibits extremely low overlap vs. all other classifiers. 
At this point in discussion, we are not yet certain whether C8 is far better or far 
worse than the rest of the pack, though intuition suggests the latter is more likely. 

In addition to looking at overlap between classifiers, we also inspected Fleiss re, 
a widely used measure for annotator agreement between a fixed number of raters 
[16] . We observe re = 0.27 between the 10 classifiers, representing low but "fair" 
(not chance) agreement. If we exclude the outlier C8, however, we observe far 
higher re = 0.50 over the remaining 9 classifiers, indicating "moderate" agreement. 
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2* 


X 

s 


0.64 
0.15 


0.52 
0.15 


0.67 
0.18 


0.63 
0.16 



Table 3 Actual performance of classifiers on validation set (D3) expert judgments for all four 
metrics considered (mean x and standard deviation s are also shown). Tied ranks according 
to statistical significance testing are indicated by * (95% confidence). 
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0.62 




0.58 




0.74 




0.50 




s 


0.11 




0.13 




0.22 




0.15 





Table 4 Actual performance of classifiers on test set (D4) expert judgments for all four 
metrics considered (mean x and standard deviation s are also shown). Tied ranks according 
to statistical significance testing are indicated by * (95% confidence). 



5.2 Actual Classifier Performance 

To measure our accuracy of estimating classifier performance, we must first com- 
pute the actual performance of classifiers according to expert judgments. We begin 
with the validation set (D3) and measure actual performance of the f classifiers. 
Results appear in Table [3j We see results on all four classification metrics, as 
well as sample mean x and standard deviation s. Tied ranks according to two- 
tailed paired-t test are indicated by * (use of *M vs. N* merely groups equivalent 
rank values vertically). Classifiers perform rather comparably, though classifier C8 
performs far worse than other classifiers for all metrics but specificity. 

Table [4] shows the actual performance achieved by all classifiers on the test set 
(D4) based on expert judgments. 



5.3 Correlation Results on Validation Set (D3) 

This section presents our validation set (D3) results, centered on the score and 
rank correlation achieved by each estimation method introduced in Section [3] The 
statistical significance of differences in correlation achieved by alternative meth- 
ods is measured, with significance reported at 95% confidence or higher (see Sec- 
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Fig. 3 Validation set (D3) score correlation (Pearson's r) achieved by estimation methods 
from Section [3] The x-axis is labeled by method; the y-axis shows the correlation achieved. 
Results for each metric from Section |4.2| are connected by a different line. While correlation is 
high across methods for ACC, PRE, and REC metrics (r > 0.9 typically), SPE correlation is 
much lower (0.5 < r < 0.67). 



tioii 4.4 for details). Analysis of validation set results provides our initial insights 
into research questions, including identification of best performing methods to be 
evaluated on the test set (D4, Section 5.4). 

Summary of validation set findings. 1) Can we reliably estimate classi- 
fier performance? 2) Do crowd judgments enable us to do this significantly more 
effectively? We observe high Pearson score correlation of 0.9 or greater over the 
10 classifiers for three of the four classifier metrics considered. Crowd judgments 
are not seen to provide significant improvement. For rank correlation, however, we 
observe Spearman correlation between 0.87 and 0.96 with crowd judgments, but 
a lower 0.75 to 0.95 without judgments. 3) Which methods provide the best score 
and/or rank correlation for each labeled data condition and classifier metric of 
interest? Figures [3] and [4] answer this question via method vs. metric plots of raw 
correlation levels, while Table|5]ranks the alternative methods based on statistical 
significance testing of differences in correlation achieved. 

Details. Figure [3] shows Pearson's r score correlation achieved by each method 
from Section[3]for all four metrics: ACC, PRE, REC, and SPE. Correlation is quite 
high across methods and metrics except for specificity, which is much lower. No 
significant benefit is seen from use of crowd judgments vs. blind evaluation, with 
comparable correlation achieved by the best performing methods in each class 
(direct evaluation and EM, respectively). 

Figure|4]shows rank correlation (Spearman's p (a) and Kendall's r (b) ) achieved 



by the different methods from Section [3] on the validation set (D3) . The plot for 
Swap % rank correlation was quite similar to Kendall's t, so we omit it to sim- 
plify our presentation. Unlike score correlation, with rank correlation we do see 
significant improvement from use of crowd judgments. For example, we observe 
Spearman correlation between 0.87 and 0.96 with crowd judgments. Without judg- 



11 



Hyun Joon Jung, Matthew Lease 




— « Accuracy — Precision *— Recall — > — Specificity 

(a) Spearman's p rank correlation on the validation set (D3) 




^ & * ^ ^ J? ^ 

— < Accuracy — Precision *— Recall — • — Specificity 

(b) Kendall's r rank correlation on the validation set (D3) 



Fig. 4 Validation set (D3) rank correlation, Spearman's p 
Section[3]methods. The x-axis is labeled by method. From 



(a) 



and Kendall's - 



(b) 



achieved by 



cft-to-right: the four blind methods: 
MV and" EM (C&S) followed by RR and Sampling (S&C); next the five supervised C&S 
methods: SVM, NB, Ada, GLM, and Calibrated MV; finally crowd-based direct evaluation. 
The y-axis shows the correlation achieved by each method. Results for each classifier metric, 
ACC, PRE, REC, and SPE, are plotted by a different line. 



ments, correlation achieved by blind methods spans a wider range, from around 
0.75 to 0.95. With Kendall's r, we see around 0.78-0.92 correlation across metrics 
for direct evaluation (with crowd judgments), while blind EM ranges from 0.5-0.9 
correlation across metrics. 

Table[5]ranks methods (with ties) for all four classifier metrics according degree 
of correlation achieved with evaluation on expert judgments in the validation set 
(D3). In comparison to raw correlation values shown in Figures [3] and [4j method 
rankings in Table [5] reflect statistical significance testing, i.e. a higher ranked 
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3 


10 
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1 
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2 
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3 


6 
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3 


3 
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6 
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1 


1 


5 


2 


6 


5 


2 


3 


6 


6 


2 


3 


6 


7 
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MV+Calib(0.6) 


2 


1 


1 


5 


8 


2 


1 


2 


8 


2 


1 


5 


8 


2 


1 


5 






Direct-eval 


1111 


1 


1 


5 


1 


1 


1 


6 


1 


1111 



Table 5 Validation set (D3) results show relative correlation achieved by alternative estima- 
tion methods introduced in Section [3] Correlation is measured between predicted scores and 
derivative ranking of classifiers, vs. actual scores and derivative ranking based on expert judg- 
ments. Note that ranks shown refer to estimation methods, not the classifiers, indicating the 
relative correlated achieved by alternative methods. A higher ranked method achieves statisti- 
cally significantly better correlation, with tied ranks indicating equivalent corr elatio n. Ranks 
reflect triangle significance testing of correlation at 95% confidence (see Section |4.4|. 



method achieves statistically significantly better correlation, with tied ranks in- 
dicating equivalent correlation. The best method using crowd judgments, direct 
evaluation, is shown in the bottom row. Without judgments, sampling achieves 
lower correlation than direct evaluation, particularly for the SPE metric, as noted 
earlier. Based on these results, we also select two combine & score methods to 
evaluate in final testing: EM as a blind method and NB as a supervised method. 



5.4 Correlation Results on Test Set (D4) 
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11.1* 8.9 11.1 22.2 
13.3 2.2t 6.7 20.0* 



Table 6 Test set results (D4): correlation of alternative prediction methods vs. actual 
scores and ranks from NIST judgments. Blind methods use no labeled data while non-blind 
methods make use of crowd labels (D2). C&S denotes a "Combine & Score" approach while 
S&C denotes a "Score & Combine" approach. Direct denotes evaluating classifiers directly 
on crowd labels (D2). Statistical significance of correlation differences between EM and other 
prediction methods is indicated by * (95% confidence) and t (99% confidence). 



Informed by validation set results (Section |5.3[ ), we selected a top performing 
method from each method group considered and evaluated these methods on NIST 
gold judgments (D4). For blind evaluation with combine. & score, EM is reported. 
For supervised combine & score, we report NB. Sampling is reported for score & 
combine. In the crowd label (D2) condition, we again compare use of these labels 
for supervised combine & score vs. simple direct evaluation. 

As shown in Table [6] most methods show strong score correlation with eval- 
uation based on NIST judgments. For rank correlation, all of sampling, NB, and 
direct evaluation methods achieve significantly higher correlation than EM. In the 
absence of expert judgments, score & combine methods show better correlations 
rather than unsupervised combine & score methods, consistent with earlier results 
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on the validation set. With crowd judgments, direct evaluation achieved signif- 
icantly higher correlation than NB on both PRE and SPE for Kendall r rank 
correlation. 



5.5 Quality of Pseudo-Gold vs. Correlation 

The combine & score approach offers a potentially valuable separation of concerns: 
let machine learning research tackle the label aggregation problem [26], then use 
the output "pseudo-gold" as if it were expert gold. Intuitively, the more accurate 
the pseudo-gold, the better this strategy can be expected to work. A key concern, 
however, is how robust evaluation on pseudo-gold will be to impurity (e.g. labeling 
errors)? To investigate this, we analyze the relationship between: a) labeling qual- 
ity of alternative combine & score methods, vs. b) derivative correlation between 
predicted vs. actual scores and ranks of classifiers. We quantify the relationship 
between label quality and correlation by computing another, secondary Pearson 
r correlation between label quality and each of the primary correlations in evalu- 
ation accuracy achieved by the estimation method (the four different correlation 
measures discussed thus far). 

For each classification metric (ACC, PRE, REC, SPE), we compute the sec- 
ondary Pearson r correlation between the label quality vs. primary correlation 
over the 7 different combine & score methods considered. Table [8] shows the re- 
sults on the test set (D4). Tablets caption further details our process to generate 
these results. Note that Swap % values in Tableware negative since it is a mea- 
sure of error while performance metrics measure quality, hence they are inversely 
correlated, as the negative values indicate. 



Method 


ACC 


PRE 


REC 


SPE 


MV 


0.691 


0.642 


0.864 


0.518 


EM 


0.692 


0.671 


0.752 


0.632 


MV+Calib(0.6) 


0.692 


0.671 


0.752 


0.632 


Ada 


0.691 


0.666 


0.708 


0.614 


NB 


0.692 


0.661 


0.790 


0.594 


GLM 


0.690 


0.663 


0.774 


0.606 


SVM 


0.676 


0.651 


0.762 


0.590 



Table 7 Test set (D4) quality of pseudo-gold labels output by unsupervised and supervised 
combine & score methods (Rows 1-2 vs. Rows 3-7, respectively. 



Results shown in Figure [5] empirically validate our intuition: the performance of 
the combine & score method (i.e. the quality of pseudo-gold it produces) is strongly 
correlated with how effectively we can predict classifier scores and correctly rank 
classifiers. Y-axis in the figure means quality of each pseduo-ground truth. X-axis in 
the figure indicates a correlation coefficient of predicting classification performance 
achieved by the pseudo-gold. The regression line shows correlation strength, which 
is 0.71 (blue) for Pearson's r, 0.56 for Kendall's r, and 0.57 for Spearman's p. As 
our expectation, the higher quality of pseudo-gold achieves the higher correlation 
coefficient, which demonstrates that there is strong correlation between pseudo- 
gold quality and its accuracy of classifier performance prediction. Consequently, 
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- Regression line (r) 
Kendall's tau 

- Regression line (tan) 
Spearman's rho 
Regression line (rho) 




correlation coefficient 

Fig. 5 How the quality of pseudo-gold labels produced by combine & score methods impacts 
effectiveness in predicting classifier scores and rankings. Strength of correlation is measured 
by Pearson rbetween a quality of pseudo-gold and a correlation coefficient achieved by the 
pseudo-gold. 



Metric 


Pearson r 


Spearman p 


Kendall r 


Swap % 


ACC 


0.34 


0.38 


0.22 


-0.24 


PRE 


0.32 


0.28 


0.32 


-0.30 


REC 


0.71 


0.57 


0.64 


-0.65 


SPE 


0.62 


0.66 


0.69 


-0.49 



Table 8 Combine & Score methods generate pseudo-gold for evaluating classifiers. How robust 
is evaluation to errors in the pseudo-gold? Tabic [7] shows the quality of pseudo-gold labels 
output by the 7 methods for each metric. Evaluating classifiers on pseudo-gold vs. expert 
judgments yields 4 correlation values per method for each metric. Re-grouping these values 
yields 7 correlation values per correlation measure for each metric. Finally, we compute Pearson 
r correlation between the 7 label quality scores vs. the 7 correlation values for each metric. 
Results are shown in this Table for the test set (D4). Note that Swap % is an error measure, 
so is inversely related to label quality (smaller r values are better). 

we assure that it is critical to generate a highly accurate pseudo-gold labels for 
accurately evaluating classifier performance in the absence of ground truth. 

As expected, the general trend shows label quality is strongly correlated with 
how effectively we can predict classifier scores and derivative ranks. The ACC 
and PRE differences between 7 pseudo-gold labels are too tiny to show strong 
correlations. However, the differences of REC and SPE between the given pseudo 
gold labels are reasonably large, thus there are stronger correlations than in the 
former two metrics. 



5.6 Detecting Outliers in Blind Evaluation 



A predominant concern in prior blind evaluation studies has been difficulty pre- 
dicting performance of the best systems, since the uniqueness which distinguishes 
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them is not "affirmed" by other systems [3ll3T] . Moreover, success of blind evalua- 
tion is likely limited by overall quality of the system pool: if all of the systems are 
terrible, we cannot expect to derive much value from their outputs. 





ACC* PRE REC SPE 


Actual Rank 


Predicted Rank 


1 (best) 

2 

3 


1 4 14 
1 1 2 10 
6 13 3 


10 (worst) 


10 10 10 9 



Table 9 Rows 1-3: EM predicted ranks for the top-3 classifiers per metric according to NIST 
judgments (D4). Note that the top-3 classifiers for ACC actually tied for rank 1. Row 4 shows 
EM predicted rank for the worst classifier based on NIST judgments. 

This section focuses exclusively on blind evaluation, analyzing how well we are 
able to predict performance of the best and worst classifiers when no judgments are 
available. We begin by analyzing our ability to accurately rank the top-3 classifiers 
according to NIST judgments (test set D4). At the other extreme, we analyze 
effectiveness of identifying the worst performer, classifier C8. Refer to Table [4] for 
actual performance achieved by all classifiers on the test set. As a representative 
estimation method for blind evaluation, we arbitrarily select EM (Section |3.2.2[ ) 
and compare its predicted scores and ranks to the actual scores and ranks. 

Table [9] presents our results. Rows 1-3 show the predicted ranks for the top 3 
classifiers (according to actual evaluation on NIST judgments). For example, we 
see the best classifier for PRE is only predicted at rank 5 by EM, while the 2nd- 
ranked classifier is predicted at rank 1. Note that for ACC, the top-3 classifiers 
are actually tied (equivalent). 

Can the best classifiers be accurately predicted with blind evaluation? Results 
here indicate the answer depends on the classification metric of interest. For REC, 
the EM predicted ranking perfectly matches the actual ranking. With ACC, the 
top-2 systems are correctly predicted but the 3rd system (tied with the others) is 
mistakenly predicted at rank 6. For PRE, the 2nd and 3rd best systems are nearly 
predicted, while the best system is predicted at rank 5 (noted earlier). SPE is by 
far seen to be the most difficult metric to predict accurately, consisitent with our 
earlier findings for all systems on the validation set (Figures [3] and [4| . 

Can the worst classifier be accurately detected with blind evaluation? Row 4 
in Table [9] shows predicted vs. actual rank of the worst performing classifier for 
each metric (C8 for all metrics but SPE). For ACC, PRE, and REC, the predicted 
rank matches the actual rank, while for SPE the predicted rank was off by one. 
Therefore, our ability to identify the worst system is fairly consistent with our 
ability to identify the best systems in depending on the metric of interest. 

6 Conclusion 

This paper reported our investigation into methods for evaluating classifiers us- 
ing either no judgments (blind evaluation) or crowd judgments. We pursued two 
general strategies. Combine & Score methods aggregated classifier outputs into a 
single pseudo-gold judgment set on which classifiers were scored. Score & Combine 
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methods scored classifiers on multiple sets of judgments sampled from classifier 
outputs, then averaged performance across judgment sets. In the case of crowd 
judgments being available, we explored two approaches for exploiting them: either 
direct evaluation of classifiers or supervising combine- and- score methods. Classi- 
fiers were scored on four standard metrics and then ranked based on statistically 
significant differences. Experiments were conducted with 10 classifiers developed 
independently by teams participating in the TREC 2011 Crowdsourcing Track. 

To evaluate our methods, we reported score and rank correlation measures 
comparing actual classifier performance vs. predicted performance. To correctly 
interpret differences in correlation observed, we utilized Hotelling's triangle test- 
ing approach [19] (Section |4.4[ ) which was originally proposed by Hotelling and 
investigated by Steiger in [33] : we were not familiar with established methodology 
in the IR community for such testing. 

In regard to research questions (Section [l]), results showed high score corre- 
lation for three of the four classifier metrics considered. While crowd judgments 
were not seen to provide significant improvement for score correlation, they did 
significantly benefit rank correlation. When crowd judgments were available, we 
found that direct evaluation on them outperformed their use to supervise combine 
& score methods. In the blind evaluation case, simple round-robin evaluation was 
typically as effective as more complicated EM, but significantly outperformed the 
more popular MV approach. As expected, lower quality of labels output by com- 
bine & score did yield less accurate evaluation, though evaluation was reasonably 
tolerant of some amount of label noise, particularly in regard to ranking classifiers 
based on accuracy. Finally, blind evaluation for outliers was surprisingly accurate, 
though imperfections still remain; there is no silver bullet here, but a fairly ef- 
fective approximation of evaluation based on expert judgments. Crowd judgments 
can further improve this approximation at some additional cost, but likely still 
cheaper than collecting expert judgments. 

Future work will perform comparative studies across other datasets to better 
assess robustness of findings, consider alternative methods for inducing a rank- 
ing from systems scores (and significance tests), as well as impact of other rank 
correlation measures [8l l22l[37] . 
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